Diverse evolutionary rates and gene duplication patterns among families of functional olfactory receptor genes in humans

In humans, odors are detected by ~400 functional olfactory receptor (OR) genes. The superfamily of functional OR genes can be further divided into tens of families. In large part, the OR genes have experienced extensive tandem duplications, which have led to gene gains and losses. However, whether different OR gene families have experienced distinct modes of gene duplication has yet to be reported. We conducted comparative genomic and evolutionary analyses for human functional OR genes. Based on analysis of human-mouse 1–1 orthologs, we found that human functional OR genes show higher-than-average evolutionary rates, and there are significant differences among families of functional OR genes. Via comparison with seven vertebrate outgroups, families of human functional OR genes show different extents of gene synteny conservation. Although the superfamily of human functional OR genes is enriched in tandem and proximal duplications, there are particular families which are enriched in segmental duplications. These findings suggest that human functional OR genes may be governed by different evolutionary mechanisms and that large-scale gene duplications have contributed to the early evolution of human functional OR genes.


Introduction
Humans perceive a variety of odors. These odorants are detected by olfactory receptors (ORs), a type of seven-transmembrane domain G protein-coupled receptors [1][2][3]. The OR genes are one of the largest families in the human genome, consisting of~400 protein-coding genes and another~400 pseudogenes [4][5][6][7]. The OR genes can be further classified into tens of families and hundreds of subfamilies [4][5][6]. It is presumed that each subfamily of functional OR genes is dedicated to the detection of a particular class of odorants, though this does not preclude the recognition of a particular class of odorants by different subfamilies [4].
OR genes are distributed at many loci across 21 chromosomes in the human genome. Most OR subfamilies reside in a single chromosomal locus, and many loci encode only one or a few subfamilies. This pattern of chromosomal distribution was presumed to be caused by extensive tandem gene duplications among OR genes [8,9]. However, there are also a substantial number of cases in which OR genes of the same subfamilies are located on different chromosomal regions, and OR genes of different subfamilies are sometimes located closely in a chromosomal region [4,6,9]. Comparison of different species shows that the number of OR genes vary extensively among different species, and many species have a substantial number of pseudogenes [10][11][12]. It has been suggested that OR genes have experienced extensive gains and losses during evolution, presumably to meet the specific environmental/dieting niche(s) of each species [10,[12][13][14][15][16]. These previous studies collectively suggest the complexity in the evolutionary mechanisms of OR genes.
In addition to single-gene duplications, such as tandem duplication, large-scale gene duplications, such as whole-genome duplication (WGD) and segmental duplication, have occurred in vertebrate evolution [17,18]. Genes created by different modes have often experienced different patterns of evolutionary tempos and gene gains and losses [19][20][21]. Moreover, gene relocations frequently occur during evolution [22][23][24]. However, little is known regarding the contributions of different gene duplication modes, especially large-scale gene duplication, to the evolution of OR genes.
The aim of this analysis is to better understand the evolutionary mechanisms of human functional OR genes via a variety of comparative genomic and evolutionary analyses. We have excluded pseudogenes as they are not protein-coding (i.e., not related to human odor detection).

Human functional OR genes are under a more relaxed purifying selection and are more affected by balancing selection
We compiled a list of 399 functional OR genes in the human genome (see Methods). These functional OR genes comprise 17 families. The sizes and members of the families of human functional OR genes are shown in S1 Table. The family size of the various human functional OR genes vary from 3 to 68 members. Detailed information of the human functional OR genes can be viewed in S2 Table. We first studied the evolutionary rates of human functional OR genes based on analysis of human-mouse 1-1 orthologs. The evolutionary rates are denoted by non-synonymous substitution rate (Ka), synonymous substitution rate (Ks), and their ratio (Ka/Ks) (i.e., the rate of nonsynonymous substitutions corrected for neutral rates). Comparisons of Ka, Ks and Ka/Ks distributions between functional OR genes and all genes in humans indicate that functional OR genes evolved relatively faster in the human genome (Fig 1). As Ka/Ks is also indicative of selection pressure, a trend of higher Ka/Ks for functional OR genes indicates that human functional OR genes are, in general, under a relaxed purifying selection ( Fig 1C). We next investigated the selection patterns of human functional OR genes in terms of deviation from neutrality. We computed Tajima's D [25] for each functional OR gene based on the African (AFR) and European (EUR) populations of the 1000 Genomes Project [26]. Comparisons of Tajima's D between functional OR genes and all genes in the human genome show that functional OR genes tend to have a higher Tajima's D than the genome average (Fig 2), indicating that functional OR genes tend to be more affected by balancing selection in the human genome. Collectively, the above analyses suggest that in the human genome, functional OR genes tend to be under a more relaxed purifying selection and are more affected by balancing selection, further suggesting that human functional OR genes tend to preserve higher functional diversity, which may render a great variability in human olfactory cognition of numerous environmental olfactory stimuli.

Human functional OR genes display diverse evolutionary rates among their families
We next investigated whether there are differences in the evolutionary rates among families of human functional OR genes. To this end, we compared Ka, Ka/Ks and Tajima's D among families of human functional OR genes. Comparisons of Ka and Ka/Ks were statistically significant based on a Kruskal-Wallis test (P = 1.06×10 −5 and 7.89×10 −3 , respectively) (Fig 3). We heuristically clustered families of human functional OR genes in terms of evolutionary rates and found that families 3, 7 and 14, whose sizes are relatively small (3, 11 and 6 members respectively) have a higher Ka (P = 1.42×10 −6 , Wilcoxon test) and Ka/Ks (P = 2.49×10 −4 , Wilcoxon test) than other families (Fig 3A and 3B). In contrast, the differences in Tajima's D among families of human functional OR genes were much more moderate (P = 2.25 ×10 −2 and 9.46 ×10 −2 for AFR and EUR, respectively) ( Fig 3C and 3D). These analyses suggest that families of human functional OR genes may differ in evolutionary rates, possibly pertaining to their different functions.

Different gene synteny patterns among families of human functional OR genes
During the evolution of genomes, genes may remain on corresponding chromosomes (synteny) and corresponding orders (collinearity). It is known that in the human genome, most OR genes form local clusters, presumed to be created by extensive tandem duplications [8,9]. However, the extent of different functional OR genes remaining at ancestral chromosomal  (i.e., syntenic) locations is poorly known. In addition, possible roles of synteny and collinearity among OR duplicated genes, presumed to be created by large-scale duplications, have been rarely reported.
We performed both intra-species and cross-species synteny and collinearity analyses across eight species, including human, gorilla, macaque, mouse, chicken, lizard, frog and zebrafish. The evolutionary relationships of these species are shown in Fig 4. Genes with fewer syntenic genes across genomes are presumed to have been created more recently, or have undergone more relocations, often associated with the reshuffling of chromosomal segments. For each human functional OR gene, we computed the number of syntenic genes in the outgroup species. We found that families of human functional OR genes significantly differ in the number of syntenic genes in the outgroup species (Fig 5, P = 1.42×10 −30 , Kruskal-Wallis test). Notably, families 51, 52 and 56 show much higher numbers of syntenic genes in the outgroups than the other families (P = 6.22×10 −3 , Wilcoxon test), which may indicate that these families are more likely to perform conserved functions across the evolution of vertebrates.
Analysis of the intra-species collinear genes can indicate that duplicated genes have undergone large-scale gene duplication events, such as whole-genome duplication (WGD) or segmental duplication. We, thus, next generated the gene duplication modes among human functional OR genes (S3 Table), including 18 pairs of collinear (i.e. WGD/segmental) duplicates, and the data suggest that large-scale duplications may have substantially contributed to the evolution of human functional OR genes. To intuitively view gene duplication modes, we plotted WGD/segmental duplications and tandem duplications based on genome circles for all human functional OR genes and each family (S1 Fig). We computed enrichment of gene duplication modes for each family (S4 Table). Expectedly, the superfamily of human functional OR genes is enriched for tandem and proximal gene duplications. However, families 1, 14 and 52 are enriched for WGD/segmental duplications. This analysis suggests that large-scale duplications have played an essential role in the expansion and retention of human functional OR genes, especially in a few families. We observed that the locations of duplicate genes created by WGD/segmental duplications often coincide with tandem duplications. As tandem duplications often arise by unequal crossing-over [27], this observation suggests that a possible mechanism for human functional OR gene expansion could be frequent unequal crossingover following large-scale gene duplications. We found that the Ks (a proxy for gene duplication ages) values of OR collinear duplicate pairs are significantly higher than OR tandem duplicate pairs (P = 2.70×10 −8 , Wilcoxon test. The above analyses collectively suggest that large-scale gene duplications have played an essential role during the evolutionary of human functional OR genes.

Discussion
We observed that OR gene families significantly differ in the number of syntenic genes in the outgroup species. One explanation is that for more recent OR gene families, they may tend to have fewer syntenic genes for each member. We understand that genome annotation quality may be different among different genomes and annotation versions. As we are comparing the number of syntenic genes among human functional OR families, all OR families should be affected by similar levels of uncertainties, which is not likely to cause invalidation of our observation.
Small-scale duplications, such as tandem and transposed duplications, often have a higher evaporating rate [20]. This is consistent with the fact that in the human genome there are a comparable number of OR pseudogenes to functional OR genes [10][11][12]. In addition to tandem duplications, small-scale gene duplications can be further distinguished by proximal duplications (i.e., near one another but separated by a few genes) and by dispersed duplications (i.e., neither adjacent to each other in the genome nor within homologous chromosome segments) [23,28]. Tandem duplication is likely to arise from unequal crossing-over while proximal duplication is seemingly caused by localized transposon activities [23,28]. Distant single gene transposition may explain part of the dispersed duplicates segments [23,28]. Considering that OR genes are enriched for tandem and proximal genes, unequal crossing-over could be a primary mechanism for OR gene expansion and possibly for a subsequent pseudogenization. In particular, large-scale gene duplications were important for the evolution of particular OR gene families. Unequal crossing-over following segmental duplications could be an important evolutionary mechanism for several OR gene families. As there are contrasting evolutionary patterns among taste receptor genes [29,30], we suggest that diverse evolutionary mechanisms are a common phenomenon for olfactory sensory genes.
In this study, Ka/Ks were compared at superfamily and family levels. In the future we will compute Ka/Ks for branches of phylogenetic trees of OR genes and their p-values/q-values, in order to determine whether positive or negative selection is experienced. We will also associate the adaptive roles of OR genes/subfamilies with duplication bursts.

Conclusions
In this study, we carried out comparative genomic and evolutionary analyses for human functional OR genes, and we were able to differentiate the specific evolutionary patterns for the various families of human functional OR genes. We found that functional OR genes show higher-than-average evolutionary rates in the human genome, but there are significant differences among families. We found evidence that suggests that human functional OR genes are more likely to be affected by balancing selection, though they show no obvious differences among families. Human functional OR gene families, including 1, 14 and 52, are enriched for WGD/segmental duplications. This study suggests that human functional OR genes may be governed by different evolutionary mechanisms and that large-scale gene duplications have played an essential role for the expansion and the retention of human functional OR genes.

Functional OR genes and gene families in humans
We obtained the list of human functional OR genes from Ensembl (https://uswest.ensembl. org/index.html) by searching the human genome for the keywords "olfactory and receptor", restricting the returned result within the "gene" category and then removing pseudogenes by a custom Perl program.
The description of OR genes from Ensembl followed a convention "olfactory receptor family X subfamily X member X". Thus, families of human functional OR genes were generated according to their gene description. For example, family 1 of human functional OR genes contain those OR genes with the keyword "family 1" in their description and without the keyword "pseudogene".
The protein sequences, coding sequences and coordinates of human functional OR genes were also retrieved from Ensembl.

Detection of gene synteny and collinearity
Protein sequences, CDS sequences in FASTA format and gene positions of the human genomes and seven outgroup genomes, including gorilla, macaque, mouse, chicken, lizard, frog and zebrafish, were retrieved from Ensembl. Versions of genome assembly are listed in S5 Table. For any genes that had more than one transcript, only the longest transcript was included in the annotation.
To search for homology between human and outgroup genomes, we conducted an all-vs-all BLASTP for each outgroup genome against human, and human against each outgroup genome, respectively, with an e-value cut-off of e -10 and the top five hits kept in each target genome. We also performed BLASTP for the human genome against itself with the same parameter setting and kept the top five non-self-hits for each gene.
To identify syntenic blocks between genomes, we concatenated all the above inter-/intraspecies m6 BLASTP outputs into a.blast file and concatenated all gene positions of different genomes into a.gff file. Then we searched the syntenic blocks and collinear gene pairs using the software MCScanX [31] with the following default parameters: a match score of 50, gap penalty of −1, E-value of e −5 , maximum gap size between any two consecutive protein pairs of 25 and at least five consecutive proteins to define a syntenic region. Syntenic blocks within the human genome were also generated using MCScanX with the same parameter setting.

Analysis of duplicate gene origins
Origins of genes of a genome can be classified as singletons, whole genome/segmental (i.e., collinear genes in collinear blocks), tandem (consecutive repeat), proximal (in nearby chromosomal region but not adjacent) or dispersed (other modes than segmental, tandem, and proximal) duplications, depending on their copy number and genomic distribution. We used duplicate_gene_classifier from MCScanX [31] to classify origins of duplicate genes into the five classes for the human genome.
To detect gene duplications and their modes for the families of human functional OR genes, we first performed MCScanX-transposed [32], a software package based on execution of MCScanX, for the human genome, with the other seven genomes as outgroups. The MCScanX-transposed output was directly used as input for the downstream analysis program, detect_dup_modes_for_a_family, to detect duplicate gene pairs of different modes for the superfamily and families of human functional OR genes respectively.
We used origin_enrichment_analysis from MCScanX to identify potential enrichment of duplicate gene origins based on the result of duplicate gene origins. The P-value was calculated based on the null hypothesis that there was no association between the members of a gene family and a particular gene duplication mode and was corrected with the total number of duplication modes for multiple comparisons (i.e., Bonferroni correction).

Ka and Ks calculation
Human and mouse 1-to-1 orthologous genes were retrieved from the vertebrate homology database (http://www.informatics.jax.org/homology.shtml). More specifically, we searched the "Mouse/Human Orthology with Phenotype Annotations" dataset (https://www.informatics. jax.org/downloads/reports/HMD_HumanPhenotype.rpt) for 1-1 orthologs. Then the OR gene names in the mouse genome were manually converted to gene names in Ensembl using the cross-reference links on their MGI (i.e. Mouse Genome Informatics, https://www. informatics.jax.org/) page.
We generated Ka and Ks values for duplicate gene pairs and orthologous gene pairs using the Yang and Nielsen method [33], available in the yn00 module of the PAML package [34]. We excluded the gene pairs with Ka or Ks < 0 from analysis. The Ka/Ks ratio was calculated by retrieving the 'omega' value of the MLmatrix.

Circle plots
Circle plots displaying gene duplication modes were drawn using the Circos software [36].

Statistical analysis
Nonparametric tests were used for significance analysis due to non-normal distributions for the compared metrics. Comparisons between human functional OR genes and all genes were conducted by a Wilcoxon test. Comparisons among families of human functional OR genes were conducted by a Kruskal-Wallis test (one-way ANOVA on ranks).